Model Training - Classic Cross-Validation¶

Various combinations of model architectures, features, and training approaches were tested:

Model Architectures¶

  1. Baseline Model (avg): Predict the average 'K%' across a pitcher's available data.
  2. Baseline Model (last): Predict the last observed 'K%' from the pitcher's available data.
  3. Baseline Model (xK%): Predict 'K%' using the formula: xK% = -0.61 + (L/Str * 1.1538) + (S/Str * 1.4696) + (F/Str * 0.9417) (see The Definitive Pitcher Expected K% Formula).
  4. Linear Regression Model
  5. Random Forest Model
  6. XGBoost Model

Features¶

  1. All 67 Features:
    • 31 categorical one-hot encoded features for teams (30 MLB teams plus '---' for multi-team).
    • 36 numeric features.
  2. 7 Features Selected from Lasso Model (see 03-feature-engineering.ipynb):
    • 'numeric__Pit/PA'
    • 'numeric__Str%'
    • 'numeric__F/Str'
    • 'numeric__I/Str'
    • 'numeric__Con'
    • 'numeric__30%'
    • 'numeric__L/SO'

Training Approach¶

Two training schemas will be utilized for model training, with MSE (mean squared error) used to evaluate model performance:

  1. Classical Cross-Validation: The data will be split into K-folds, with the goal of predicting 'K%'.
  2. Time-Series Cross-Validation: Folds will be based on the season, using earlier years (e.g., 2021) to predict subsequent years (e.g., 2022), and so on (e.g., 2021-2022 to predict 2023).

THIS NOTEBOOK USES THE CLASSICAL CROSS-VALIDATION APPROACH

  • For more details, refer to 02-data-partitioning.ipynb.
graph TD
    A["Player Pool"]
    A --> B["Training Pool"]

    subgraph TrainingFlow [" "]
        direction LR
        D["2021 --- 2022 --- 2023 "]
        F["X 2024"]:::red
    end
    B -- Training Flow --> D


    subgraph CVClassic ["Classic CV: All years used to predict K%"]
        FoldTitle1["Fold1"]:::noBorder
        FoldTitle2["Fold2"]:::noBorder
        FoldTitle3["Fold3"]:::noBorder

        Split1["Split1"]:::noBorder
        Fold1["Fold1"]:::blue
        Fold2["Fold2"]:::green
        Fold3["Fold3"]:::green

        Split2["Split2"]:::noBorder
        Fold4["Fold1"]:::green
        Fold5["Fold2"]:::blue
        Fold6["Fold3"]:::green

        Split3["Split3"]:::noBorder
        Fold7["Fold1"]:::green
        Fold8["Fold2"]:::green
        Fold9["Fold3"]:::blue

        FoldTitle1 ~~~ Fold1
        FoldTitle2 ~~~ Fold2
        FoldTitle3 ~~~ Fold3

        Split1 ~~~ Split2
        Split2 ~~~ Split3

        Fold1 ~~~ Fold4
        Fold2 ~~~ Fold5
        Fold3 ~~~ Fold6

        Fold4 ~~~ Fold7
        Fold5 ~~~ Fold8
        Fold6 ~~~ Fold9
    end

    TrainingFlow --> CVClassic

    classDef red fill:#FFCCCC,stroke:#FF0000,stroke-width:2px;
    classDef green fill:#CCFFCC,stroke:#00FF00,stroke-width:2px;
    classDef blue fill:#CCCCFF,stroke:#0000FF,stroke-width:2px;
    classDef noBorder fill:none,stroke:none,color:#000000;
    classDef transparent fill:#FFFFFF,stroke:#FFFFFF,stroke-width:2px,opacity:0;

Inspired by scikit-learn:

  • https://scikit-learn.org/stable/modules/cross_validation.html
  • https://scikit-learn.org/1.5/modules/cross_validation.html#time-series-split

Development Workflow¶

All functions and pipelines demonstrated in this notebook are defined in the bullpen.data_utils and bullpen.model_utils modules for clarity, reusability, and unit testing. While this notebook retains the initial development and intent of these functions, their inclusion here is primarily for transparency and ease of reference.

For production usage, refer to the source code in the bullpen.data_utils and bullpen.model_utils modules.

In [1]:
import joblib
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats
import xgboost as xgb
from sklearn import set_config
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

set_config(display="text")

from bullpen import data_utils, model_utils
In [2]:
train = pd.read_csv(data_utils.DATA_DIR.joinpath("train.csv"))
train
Out[2]:
PlayerId Team Season MLBAMID Name Age TBF K% Rk IP ... 02s 02h L/SO S/SO L/SO% 3pK 4pW PAu Pitu Stru
0 18655 ATL 2021 621345 A.J. Minter 27 221 0.257919 696 52.1 ... 44 7 11 46 0.193 11 4 0 0 0
1 18655 ATL 2022 621345 A.J. Minter 28 271 0.346863 649 70.0 ... 50 2 23 71 0.245 12 0 0 0 0
2 18655 ATL 2023 621345 A.J. Minter 29 260 0.315385 647 64.2 ... 40 4 13 69 0.159 8 1 0 0 0
3 19343 OAK 2022 640462 A.J. Puk 27 281 0.270463 773 66.1 ... 48 6 22 54 0.289 15 4 0 0 0
4 19343 MIA 2023 640462 A.J. Puk 28 242 0.322314 755 56.2 ... 42 6 22 56 0.282 16 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
983 1943 HOU 2021 425844 Zack Greinke 37 697 0.172166 417 171.0 ... 51 4 34 85 0.283 13 6 0 0 0
984 1943 KCR 2022 425844 Zack Greinke 38 585 0.124786 396 137.0 ... 39 3 22 51 0.301 7 2 0 0 0
985 1943 KCR 2023 425844 Zack Greinke 39 593 0.163575 353 142.1 ... 53 6 25 70 0.263 11 2 0 0 0
986 25918 STL 2022 668868 Zack Thompson 24 136 0.198529 967 34.2 ... 40 3 10 17 0.370 3 2 0 0 0
987 25918 STL 2023 668868 Zack Thompson 25 287 0.250871 942 66.1 ... 43 2 23 48 0.324 12 4 0 0 0

988 rows × 39 columns

In [3]:
target = "K%"
drop_cols = ["Name", "Rk", "PAu", "Pitu", "Stru", target]
lasso_features = [
    "Pit/PA",
    "Str%",
    "F/Str",
    "I/Str",
    "Con",
    "30%",
    "L/SO",
]

X_df = train[[c for c in train.columns if c not in drop_cols]]
X_df_lasso = train[lasso_features]

y_df = train[target]
In [4]:
X_df.head()
Out[4]:
PlayerId Team Season MLBAMID Age TBF IP PA Pit Pit/PA ... 30s 02% 02c 02s 02h L/SO S/SO L/SO% 3pK 4pW
0 18655 ATL 2021 621345 27 221 52.1 221 876 3.96 ... 6 0.290 64 44 7 11 46 0.193 11 4
1 18655 ATL 2022 621345 28 271 70.0 272 1111 4.08 ... 8 0.324 88 50 2 23 71 0.245 12 0
2 18655 ATL 2023 621345 29 260 64.2 260 1062 4.08 ... 5 0.273 71 40 4 13 69 0.159 8 1
3 19343 OAK 2022 640462 27 281 66.1 281 1072 3.81 ... 6 0.281 79 48 6 22 54 0.289 15 4
4 19343 MIA 2023 640462 28 242 56.2 242 950 3.93 ... 4 0.310 75 42 6 22 56 0.282 16 0

5 rows × 33 columns

In [5]:
X_df_lasso.head()
Out[5]:
Pit/PA Str% F/Str I/Str Con 30% L/SO
0 3.96 0.660 0.268 0.247 0.687 0.045 11
1 4.08 0.669 0.301 0.214 0.668 0.029 23
2 4.08 0.658 0.300 0.225 0.685 0.023 13
3 3.81 0.655 0.269 0.245 0.708 0.046 22
4 3.93 0.684 0.243 0.229 0.670 0.025 22
In [6]:
y_df.head()
Out[6]:
0    0.257919
1    0.346863
2    0.315385
3    0.270463
4    0.322314
Name: K%, dtype: float64
In [7]:
processor = model_utils.make_processing_pipeline(
    categorical_features=["Team"],
    numeric_features=[f for f in X_df.columns if f not in ["Team"]+drop_cols],
)

processor
Out[7]:
ColumnTransformer(transformers=[('categorical',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Team']),
                                ('numeric',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['PlayerId', 'Season', 'MLBAMID', 'Age', 'TBF',
                                  'IP', 'PA', 'Pit', 'Pit/PA', 'Str', 'Str%',
                                  'L/Str', 'S/Str', 'F/Str', 'I/Str', 'AS/Str',
                                  'I/Bll', 'AS/Pit', 'Con', '1st%', '30%',
                                  '30c', '30s', '02%', '02c', '02s', '02h',
                                  'L/SO', 'S/SO', 'L/SO%', ...])])
In [8]:
processor_lasso = model_utils.make_processing_pipeline(
    categorical_features=None,
    numeric_features=list(X_df_lasso.columns),
)

processor_lasso
Out[8]:
ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['Pit/PA', 'Str%', 'F/Str', 'I/Str', 'Con',
                                  '30%', 'L/SO'])])

Model Training¶

Baseline Estimators¶

Three baseline models were created to establish a reference point and evaluate how well simple heuristics could predict a player's strikeout percentage:

  1. Baseline Model (avg): Predict the average 'K%' across a pitcher's available data.
  2. Baseline Model (last): Predict the last observed 'K%' from a pitcher's available data.
  3. Baseline Model (xK%): Predict 'K%' using the formula: xK% = -0.61 + (L/Str * 1.1538) + (S/Str * 1.4696) + (F/Str * 0.9417) (see The Definitive Pitcher Expected K% Formula).

Any advanced model should outperform these simple heuristics.

In [9]:
results = {}
In [10]:
from sklearn.base import BaseEstimator, RegressorMixin


class Baseline(BaseEstimator, RegressorMixin):
    def __init__(self, method, grouper=None, target="K%"):
        self.method = method
        self.grouper = "PlayerId" if grouper is None else grouper
        self.target = target

    def __repr__(self):
        return f"{__class__.__name__}(method={self.method!r})"

    def fit(self, X, y):
        # Merge features and target for grouping
        data = pd.concat([X, y.rename(self.target)], axis=1)

        # Compute group-level predictions
        if self.method == "last":
            self.best_params_ = "return last seen K%"
            self.group_aggregates_ = (
                data.groupby(self.grouper)[self.target].last().rename("preds")
            )
        elif self.method == "mean":
            self.best_params_ = "return player avg K%"
            self.group_aggregates_ = (
                data.groupby(self.grouper)[self.target].mean().rename("preds")
            )
        else:
            raise ValueError(
                f"Invalid method {self.method!r}. Supported methods are 'last' and 'mean'."
            )
        self.fitted_ = True
        return self

    def predict(self, X):
        if not hasattr(self, "fitted_") or not self.fitted_:
            raise ValueError(
                f"This {self} instance is not fitted yet. Call 'fit' before using this method."
            )

        preds = X.merge(
            self.group_aggregates_,
            left_on=self.grouper,
            right_index=True,
            how="left",
        )

        if preds["preds"].isnull().any():
            raise ValueError("Some groups in X were not seen during fitting.")

        return preds["preds"].to_numpy()


class ArticleModel(BaseEstimator, RegressorMixin):
    """
    See https://fantasy.fangraphs.com/the-definitive-pitcher-expected-k-formula/.
    xK% = -0.61 + (L/Str * 1.1538) + (S/Str * 1.4696) + (F/Str * 0.9417)
    """

    def __repr__(self):
        return f"{__class__.__name__}()"

    def fit(self, X, y):
        self.best_params_ = "return xK% from article"
        self.preds_ = (
            -0.61
            + (X["L/Str"] * 1.1538)
            + (X["S/Str"] * 1.4696)
            + (X["F/Str"] * 0.9417)
        )
        self.fitted_ = True
        return self

    def predict(self, X):
        if not hasattr(self, "fitted_") or not self.fitted_:
            raise ValueError(
                f"This {self} instance is not fitted yet. Call 'fit' before using this method."
            )

        if self.preds_.isnull().any():
            raise ValueError("Some groups in X were not seen during fitting.")

        return self.preds_.to_numpy()


def train_baseline(model, X, y, results):
    model.fit(X, y)
    preds = model.predict(X)
    mse = mean_squared_error(y, preds)
    score = model.score(X, y)
    params = model.best_params_
    print(f"{model} {params=} {score=:.3f} {mse=:.5f}")
    results[repr(model)] = (score, mse)

    return preds, results
In [11]:
# Baseline 1 heuristic: mean of player performance
baseline_mean_preds, results = train_baseline(
    Baseline(method="mean", grouper="PlayerId", target="K%"),
    X_df,
    y_df,
    results,
)
Baseline(method='mean') params='return player avg K%' score=0.835 mse=0.00053
In [12]:
# Baseline 2 heuristic: last K% seen from a player
baseline_last_preds, results = train_baseline(
    Baseline(method="last", grouper="PlayerId", target="K%"),
    X_df,
    y_df,
    results,
)
Baseline(method='last') params='return last seen K%' score=0.673 mse=0.00104
In [13]:
# Baseline 3 heuristic: formula from article
baseline_article_preds, results = train_baseline(ArticleModel(), X_df, y_df, results)
ArticleModel() params='return xK% from article' score=0.876 mse=0.00040
In [14]:
results
Out[14]:
{"Baseline(method='mean')": (0.8346776623757628, 0.0005268612183991187),
 "Baseline(method='last')": (0.6733395134045793, 0.0010410253353765834),
 'ArticleModel()': (0.8758402930941082, 0.00039568116079508823)}

Non-Baseline Estimators¶

With the baseline models established, more advanced models are now introduced:

  1. Linear Regression Model
  2. Random Forest Model
  3. XGBoost Model

For each of these models, two different feature sets will be tested:

  1. All 67 Features:
    • 31 categorical one-hot encoded features for teams (30 MLB teams plus '---' for multi-team).
    • 36 numeric features.
  2. 7 Features Selected from the Lasso Model (see 03-feature-engineering.ipynb):
    • 'numeric__Pit/PA'
    • 'numeric__Str%'
    • 'numeric__F/Str'
    • 'numeric__I/Str'
    • 'numeric__Con'
    • 'numeric__30%'
    • 'numeric__L/SO'

NOTE: The baseline models do not use features to make predictions—they only rely on simple heuristics, such as the mean or last data point from the target variable ('K%'). The ArticleModel used specific columns with the highest predictive power from the article. Thus, testing performance across two feature sets (full vs. limited lasso features) does not apply to the baseline models. For the remainder of the models, performance will be compared for both feature sets.

Grid Search Cross-Validation¶

For model tuning and hyperparameter optimization, we will use scikit-learn's GridSearchCV class. This class performs an exhaustive search over a specified parameter grid for an estimator, automatically testing all combinations of hyperparameters within the provided search space. This helps identify the optimal parameter values for a given model.

In addition to hyperparameter tuning, GridSearchCV also performs cross-validation for each set of parameters. Cross-validation involves partitioning the data into multiple subsets (folds), where the model is trained on some folds and tested on others, ensuring that the model's performance is evaluated across different data splits and reducing the risk of overfitting.

The cross-validation process in GridSearchCV works as follows:

  1. For each combination of hyperparameters, the model undergoes cross-validation.
  2. The performance metric (e.g., MSE) is averaged over the folds.
  3. The model with the best average performance is selected.

For more details on cross-validation, refer to scikit-learn's User Guide on Cross-validation.

Grid Search Cross-Validation
Source: scikit-learn User Guide

In [15]:
def train_model(processor, model, X, y, results, name):
    full_model = Pipeline(steps=[("processor", processor), ("regressor", model)])

    full_model.fit(X, y)
    preds = full_model.predict(X)
    mse = mean_squared_error(y, preds)
    score = full_model.score(X, y)
    params = full_model.named_steps["regressor"].best_params_
    # name = reg.named_steps["regressor"].best_estimator_.__class__.__name__
    print(f"{name} {params=} {score=:.3f} {mse=:.5f}")
    results[name] = (score, mse)

    return full_model, preds, results

Logistic Regression¶

In [16]:
lr_full = GridSearchCV(
    estimator=LinearRegression(),
    param_grid={"fit_intercept": [True, False]},
    cv=5,
)

lr_model_full, lr_preds_full, results = train_model(
    processor,
    lr_full,
    X_df,
    y_df,
    results,
    name="LinearRegression(full)",
)
LinearRegression(full) params={'fit_intercept': True} score=0.959 mse=0.00013
In [17]:
lr_feature_impr_full = model_utils.sort_features_by_coefs(
    feature_names=processor.feature_names_in_,
    coefs=lr_full.best_estimator_.coef_,
    print_top_n=5,
)
AS/Pit 0.00656
PA -0.00521
I/Str -0.00429
02s 0.00339
Pit/PA 0.00329

Note¶

The linear regression model in this analysis shows good predictive power, but we need to check for multicollinearity. Multicollinearity occurs when two or more predictor variables are highly correlated, leading to redundancy in the model. This can make the model coefficients unstable and difficult to interpret.

To confirm the presence of multicollinearity, we can calculate the Variance Inflation Factor (VIF) for each predictor variable. The VIF quantifies how much the variance of a regression coefficient is inflated due to correlation with other predictors in the model. In essence, VIF helps assess how much the variance of a variable’s coefficient increases due to its relationship with other variables in the regression model.

In [18]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

tmp = X_df.copy().drop("Team", axis=1)
vifs = [
    variance_inflation_factor(tmp.to_numpy().astype(float), i)
    for i in range(tmp.shape[1])
]
zipped = zip(tmp.columns, vifs)
sorted(zipped)
/Users/logan/venvs/mlb-pitcher/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1784: RuntimeWarning: invalid value encountered in scalar divide
  return 1 - self.ssr/self.uncentered_tss
Out[18]:
[('02%', 329.9096730720909),
 ('02c', 718.0869608416566),
 ('02h', 9.454011376977292),
 ('02s', 172.36495143449014),
 ('1st%', 519.898262385167),
 ('30%', 63.61751268852434),
 ('30c', 308.83987712968843),
 ('30s', 106.42893117591737),
 ('3pK', 29.609573365093464),
 ('4pW', 29.402411899681486),
 ('AS/Pit', 203608.92280519314),
 ('AS/Str', 6053725.230209009),
 ('Age', 326.11879910498135),
 ('Con', 100417.1284296176),
 ('F/Str', 252718.59663713438),
 ('I/Bll', nan),
 ('I/Str', 236212.72494673784),
 ('IP', 1130.8468570572325),
 ('L/SO', 42.15273128656587),
 ('L/SO%', 40.341686523066365),
 ('L/Str', 539949.5773448445),
 ('MLBAMID', 589.9135517745392),
 ('PA', 251382.70047864868),
 ('Pit', 11385.71271543312),
 ('Pit/PA', 6065.3660600827425),
 ('PlayerId', 48.203827215922075),
 ('S/SO', 162.12277755920866),
 ('S/Str', 135146.29569806182),
 ('Season', 7975849.049458189),
 ('Str', 14645.588647056655),
 ('Str%', 211900.97687766183),
 ('TBF', 242534.7834724485)]

Note¶

The concept of "good" values for the Variance Inflation Factor (VIF) can be subjective, but generally, any VIF value above 50 is considered concerning. To illustrate this, if we were to remove some of the highly correlated features and focus only on the three key predictors from the article, The Definitive Pitcher Expected K% Formula, we would observe a significant reduction in the VIF values, as demonstrated below. This suggests that using the full feature set, with its high multicollinearity, is not advisable. Instead, a more selective approach, using a smaller set of carefully chosen features, is recommended to improve model stability and interpretability.

In [19]:
features = ["L/Str", "S/Str", "F/Str"]

vifs = [
    variance_inflation_factor(tmp[features].to_numpy().astype(float), i)
    for i in range(tmp[features].shape[1])
]
zipped = zip(tmp[lasso_features].columns, vifs)
sorted(zipped)
Out[19]:
[('F/Str', 31.280018739043413),
 ('Pit/PA', 25.45300401236043),
 ('Str%', 17.603464261523573)]
In [20]:
lr_lasso = GridSearchCV(
    estimator=LinearRegression(),
    param_grid={"fit_intercept": [True, False]},
    cv=5,
)

lr_model_lasso, lr_preds_lasso, results = train_model(
    processor_lasso,
    lr_lasso,
    X_df_lasso,
    y_df,
    results,
    name="LinearRegression(lasso)",
)
LinearRegression(lasso) params={'fit_intercept': True} score=0.931 mse=0.00022
In [21]:
lr_feature_impr_lasso = model_utils.sort_features_by_coefs(
    feature_names=processor_lasso.feature_names_in_,
    coefs=lr_lasso.best_estimator_.coef_,
    print_top_n=7,
)
I/Str -0.05287
Pit/PA -0.01432
Con -0.01245
30% -0.00476
L/SO 0.00441
F/Str -0.0017
Str% -0.00035

Random Forest Regressor¶

In [22]:
rf_full = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]},
    cv=5,
)

rf_model_full, rf_preds_full, results = train_model(
    processor,
    rf_full,
    X_df,
    y_df,
    results,
    name="RandomForestRegressor(full)",
)
RandomForestRegressor(full) params={'max_depth': 15, 'n_estimators': 50} score=0.987 mse=0.00004
In [23]:
rf_feature_impr_full = model_utils.sort_features_by_coefs(
    feature_names=processor.feature_names_in_,
    coefs=rf_full.best_estimator_.feature_importances_,
    print_top_n=5,
)
3pK 0.00268
4pW 0.00082
AS/Pit 0.00056
PlayerId 0.00036
I/Str 0.00034
In [24]:
rf_lasso = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]},
    cv=5,
)

rf_model_lasso, rf_preds_lasso, results = train_model(
    processor_lasso,
    rf_lasso,
    X_df_lasso,
    y_df,
    results,
    name="RandomForestRegressor(lasso)",
)
RandomForestRegressor(lasso) params={'max_depth': 10, 'n_estimators': 100} score=0.984 mse=0.00005
In [25]:
rf_feature_impr_lasso = model_utils.sort_features_by_coefs(
    feature_names=processor_lasso.feature_names_in_,
    coefs=rf_lasso.best_estimator_.feature_importances_,
    print_top_n=7,
)
I/Str 0.85465
Con 0.05469
Pit/PA 0.03089
Str% 0.01685
30% 0.0153
L/SO 0.01524
F/Str 0.01239

XGBoost¶

xref

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In [26]:
xgb_full = GridSearchCV(
    estimator=xgb.XGBRegressor(),
    param_grid={"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]},
    cv=5,
)

xgb_model_full, xgb_preds_full, results = train_model(
    processor,
    xgb_full,
    X_df,
    y_df,
    results,
    name="XGBRegressor(full)",
)
XGBRegressor(full) params={'max_depth': 5, 'n_estimators': 100} score=1.000 mse=0.00000
In [27]:
xgb_feature_impr_full = model_utils.sort_features_by_coefs(
    feature_names=processor.feature_names_in_,
    coefs=xgb_full.best_estimator_.feature_importances_,
    print_top_n=5,
)
AS/Pit 0.00854
F/Str 0.00497
I/Str 0.00455
Str% 0.0039
S/SO 0.0036
In [28]:
xgb_lasso = GridSearchCV(
    estimator=xgb.XGBRegressor(),
    param_grid={"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]},
    cv=5,
)

xgb_model_lasso, xgb_preds_lasso, results = train_model(
    processor_lasso,
    xgb_lasso,
    X_df_lasso,
    y_df,
    results,
    name="XGBRegressor(lasso)",
)
XGBRegressor(lasso) params={'max_depth': 5, 'n_estimators': 25} score=0.978 mse=0.00007
In [29]:
xgb_feature_impr_lasso = model_utils.sort_features_by_coefs(
    feature_names=processor_lasso.feature_names_in_,
    coefs=xgb_lasso.best_estimator_.feature_importances_,
    print_top_n=7,
)
I/Str 0.70413
Con 0.21948
L/SO 0.02273
Str% 0.01692
Pit/PA 0.01621
30% 0.01163
F/Str 0.00891

Conclusion¶

  • For the tree-based models, such as the XGBoostRegressor and RandomForest, including all available features outperformed using only the features selected by Lasso.
  • For the linear model (LinearRegression), the full feature set initially provided better performance than the Lasso-selected features. However, we observed that including all features led to high multicollinearity, which can undermine the stability of the model. Therefore, a more selective approach is advised for the linear regression model. For now, we will proceed with the seven Lasso-selected features (["Pit/PA", "Str%", "F/Str", "I/Str", "Con", "30%", "L/SO"]). That said, there is potential for further improvement, and additional feature engineering could be explored if time permits.
In [30]:
results_df = pd.DataFrame(results).T
results_df.columns = ["R2", "mse"]
results_df = results_df.sort_values("mse").round(6)
results_df
Out[30]:
R2 mse
XGBRegressor(full) 0.999775 0.000001
RandomForestRegressor(full) 0.987258 0.000041
RandomForestRegressor(lasso) 0.984362 0.000050
XGBRegressor(lasso) 0.978089 0.000070
LinearRegression(full) 0.959057 0.000130
LinearRegression(lasso) 0.930844 0.000220
ArticleModel() 0.875840 0.000396
Baseline(method='mean') 0.834678 0.000527
Baseline(method='last') 0.673340 0.001041

Plotting Results¶

In [31]:
lookup = data_utils.PlayerLookup()


def plot_pred_vs_target(X_df, y_df, preds, title, mode="static"):

    if mode == "static":
        plot_model = scipy.stats.linregress(preds, y_df)
        plt.scatter(preds, y_df, alpha=0.5)
        plt.plot(
            preds,
            plot_model.intercept + plot_model.slope * preds,
            "k-",
            label=f"r^2: {plot_model.rvalue**2:.3f}",
        )
        plt.xlabel("xK%")
        plt.ylabel("K%")
        plt.title(title)
        plt.legend()
        plt.show()

    if mode == "interactive":
        data = pd.concat(
            [X_df, y_df.rename("K%"), pd.Series(preds, name="xK%")],
            axis=1,
        ).merge(lookup.mapping, on=["MLBAMID", "PlayerId"])

        fig = px.scatter(
            data,
            x="xK%",
            y="K%",
            hover_data=["Name", "Team", "Season"],
            trendline="ols",
            height=600,
            width=600,
            title=title,
        )
        fig.show()


def plot_player(player_name, X_df, y_df, preds, target_year=2024, ylim=None):
    ylim = [0, 0.51] if ylim is None else ylim
    data = pd.concat(
        [X_df, y_df.rename("K%"), pd.Series(preds, name="xK%")],
        axis=1,
    ).merge(lookup.mapping, on=["MLBAMID", "PlayerId"])

    player_mask = data.Name == player_name
    mlb_id, fangraphs_id = data.loc[player_mask, ["MLBAMID", "PlayerId"]].iloc[0]
    seasons = data.loc[player_mask, "Season"].tolist()
    ks = data.loc[player_mask, "K%"].tolist()

    target_mask = player_mask & (data.Season == target_year)
    target = (
        data.loc[target_mask, "xK%"].item() if target_mask.sum() else 0.3
    )  # 0.3 is just a placeholder for missing data
    alpha = None if target else 0
    title = f"{player_name}\n(MLBAMID: {mlb_id} FanGraphs {fangraphs_id})"
    if not target:
        title = f"{title}\n NO TARGET DATA FOR {target_year}"

    fig, ax = plt.subplots()
    ax.plot(
        pd.to_datetime(seasons, format="%Y"),
        ks,
        marker="s",
        label="Prev Year(s) K%",
    )
    ax.scatter(
        pd.to_datetime(target_year, format="%Y"),
        target,
        marker="o",
        color="g",
        s=50,
        label=f"{target_year} xK%",
        alpha=alpha,
        zorder=99,
    )
    ax.set_ylim()
    ax.legend()
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
    ax.set_xlabel("Year")
    ax.set_ylabel("K%")
    ax.set_title(title)
    plt.show()
    print(f"xK%: {target:.4f}")
    print(f"K% : {ks}")


def find_delta_extrema(X_df, y_df, preds, extrema="max"):
    diffs = np.abs(y_df - preds)
    f = getattr(np, f"arg{extrema}")
    idx = f(diffs)
    mlb_id = X_df.iloc[idx].MLBAMID
    fangraphs_id = X_df.iloc[idx].PlayerId
    name = lookup.get_name_from_id(mlb_id)
    return name, mlb_id, fangraphs_id
In [32]:
plot_pred_vs_target(
    X_df,
    y_df,
    lr_preds_full,
    "LinearRegression(full)",
)
plot_pred_vs_target(
    X_df,
    y_df,
    lr_preds_lasso,
    "LinearRegression(lasso)",
)
No description has been provided for this image
No description has been provided for this image
In [33]:
plot_pred_vs_target(
    X_df,
    y_df,
    lr_preds_full,
    "LinearRegression(full)",
    mode="interactive",
)
plot_pred_vs_target(
    X_df,
    y_df,
    lr_preds_lasso,
    "LinearRegression(lasso)",
    mode="interactive",
)
loading player ids from /Users/logan/Desktop/repos/mlb-pitcher-xK/data/player_ids.json...
In [34]:
name, mlb_id, fangraphs_id = find_delta_extrema(
    X_df,
    y_df,
    lr_preds_lasso,
    extrema="max",
)
name, mlb_id, fangraphs_id
Out[34]:
('Liam Hendriks', 521230, 3548)
In [35]:
plot_player("Liam Hendriks", X_df, y_df, lr_preds_lasso, target_year=2021)
No description has been provided for this image
xK%: 0.3526
K% : [0.42322097, 0.36170213]
In [36]:
name, mlb_id, fangraphs_id = find_delta_extrema(
    X_df,
    y_df,
    lr_preds_lasso,
    extrema="min",
)
name, mlb_id, fangraphs_id
Out[36]:
('Framber Valdez', 664285, 17295)
In [37]:
plot_player("Framber Valdez", X_df, y_df, lr_preds_lasso, target_year=2022)
No description has been provided for this image
xK%: 0.2346
K% : [0.21853147, 0.23458283, 0.24752475]
In [38]:
plot_pred_vs_target(
    X_df,
    y_df,
    rf_preds_full,
    "RandomForestRegressor(full)",
)
plot_pred_vs_target(
    X_df,
    y_df,
    rf_preds_lasso,
    "RandomForestRegressor(lasso)",
)
No description has been provided for this image
No description has been provided for this image
In [39]:
plot_pred_vs_target(
    X_df,
    y_df,
    rf_preds_full,
    "RandomForestRegressor(full)",
    mode="interactive",
)
plot_pred_vs_target(
    X_df,
    y_df,
    rf_preds_lasso,
    "RandomForestRegressor(lasso)",
    mode="interactive",
)
In [40]:
plot_pred_vs_target(
    X_df,
    y_df,
    xgb_preds_full,
    "XGBoostRegressor(full)",
)
plot_pred_vs_target(
    X_df,
    y_df,
    xgb_preds_lasso,
    "XGBoostRegressor(lasso)",
)
No description has been provided for this image
No description has been provided for this image
In [41]:
plot_pred_vs_target(
    X_df,
    y_df,
    xgb_preds_full,
    "XGBoostRegressor(full)",
    mode="interactive",
)
plot_pred_vs_target(
    X_df,
    y_df,
    xgb_preds_lasso,
    "XGBoostRegressor(lasso)",
    mode="interactive",
)
In [42]:
name, mlb_id, fangraphs_id = find_delta_extrema(
    X_df,
    y_df,
    xgb_preds_lasso,
    extrema="max",
)
name, mlb_id, fangraphs_id
Out[42]:
('Daniel Duarte', 650960, 17480)
In [43]:
plot_player("Daniel Duarte", X_df, y_df, lr_preds_lasso, target_year=2023)
No description has been provided for this image
xK%: 0.2200
K% : [0.16911765]

Save Models¶

In [45]:
# joblib.dump(lr_model_lasso, model_utils.MODEL_DIR.joinpath("linear.joblib"))
# joblib.dump(rf_model_full, model_utils.MODEL_DIR.joinpath("randomforest.joblib"))
# joblib.dump(xgb_model_full, model_utils.MODEL_DIR.joinpath("xgboost.joblib"))